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ABSTRACT 

Motivation: Proteins are essential macromolecules of life and thus 
understanding their function is of great importance. The number of 
functionally unclassified proteins is large even for simple and well 
studied organisms such as baker's yeast. Methods for determining 
protein function have shifted their focus from targeting specific pro- 
teins based solely on sequence homology to analyses of the entire 
proteome based on protein-protein interaction (PPI) networks. Since 
proteins aggregate to perform a certain function, analyzing structural 
properties of PPI networks may provide useful clues about the biologi- 
cal function of individual proteins, protein complexes they participate 
in, and even larger subcellular machines. 

Results: We design a sensitive graph theoretic method for comparing 
local structures of node neighborhoods that demonstrates that in PPI 
networks, biological function of a node and its local network structure 
are closely related. The method groups topological^ similar proteins 
under this measure in a PPI network and shows that these protein 
groups belong to the same protein complexes, perform the same 
biological functions, are localized in the same subcellular compart- 
ments, and have the same tissue expressions. Moreover, we apply 
our technique on a proteome-scale network data and infer biological 
function of yet unclassified proteins demonstrating that our method 
can provide valuable guidelines for future experimental research. 
Availability: Data is available upon request. 
Contact: natasha@ics.uci.edu 



1 INTRODUCTION 

Large amounts of biological network data are becoming availa- 
ble. We study protein-protein interaction (PPI) networks (or 
graphs), in which nodes correspond to proteins and undirec- 
ted edges represent physical interactions between them. Since a 
protein almost never acts in isolation, but rather interacts with 
other proteins in order to perform a certain function, PPI net- 
works by definition reflect the interconnected nature of biolo- 
gical processes. Analyses of PPI networks may give valuable 
insight into biological mechanisms and provide deeper under- 
standing of complex diseases. Defining the relationship between 
the PPI network topology and biological function and infer- 
ring protein function from it is one of the major challenges in 
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1.1 Background 

Various approaches for determining protein function from PPI net- 
works have been proposed. "Neighborhood-oriented" approaches 
observe the neighborhood of a protein to predict its function by 
finding the most common function(s) among its neighbors. The 
"majority rule" approach consid ers only nodes directly c onnec - 
ted to the protein of interest dSchwikowski and Fieldsl |2000J) . 
An improvement is made by al so observing ind irectly connec- 
ted level-2 neighbors of a node dChua et all 120061) . Furthermore, 
the function with the highest x 2 value amongst the functions of 
all "rt-neighboring pro teins" is assigned to the protein of interest 
jHishigaki et all 1200 lh. Other approa ches use the idea of shared 
neig hbors llSamanta and Liand . 1200 3h or the network flow-based 
idea dNabieva et alll2005T) to determine protein function. 

Several global optimization-based function prediction strategies 
have also been proposed. Any given assignment of functions to the 
whole set of unclassified proteins in a network is given a score, 
counting the number of interacting pairs of nodes with no common 
function; the functional assignment with the lowest score maximi- 
zes the presence of t he same function among interacting proteins 
l lVazquezetaIll2003h . An approach that reduc es the computati on 
requirements of this method has been proposed dSunetalll2006h . 

Cluster-based approaches are exploiting the existence of regions 
in PPI networks that contain a large number of connections bet- 
ween the constituent proteins. These dense regions are a sign of the 
common involvement of those proteins in certain biological proces- 
ses and therefore are feasible candidates for biological complexes. 
The restricted-neighborhood-search clustering algorithm efficiently 
partitions a PPI network into cluster s identifying know n and pre- 
dicting unknown protein complexes jKing et all 120041) . Similarly, 
highly connected subgraph s are used to identify clusters in networks 
l lHartuv and Shamin. 2000J) , defining the relationship between the 
PPI network size and the number and complexity of the identified 
clust ers, and iden t ifying known protein complexes from these clu- 
sters JPrzulietalll2004l) . Moreover, Czekanowski-Dice distance is 
used for protein function prediction by fo rming clusters of proteins 
sharing a high percentage of interactions terunetalll2004l) . 
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1.2 Approach 

We address the above mentioned challenge. First, we verify that in 
PPI networks of yeast and human, local network structure and biolo- 
gical function are closely related. We do this by designing a method 
that clusters together nodes of a PPI network with similar topologi- 
cal surroundings and by demonstrating that it successfully uncovers 
groups of proteins belonging to the same protein complexes, per- 
forming the same biological functions, being localized in the same 
subcellular compartments, and having the same tissue expressions. 
Since we verify this for PPI networks of a unicellular and a mul- 
ticellular eukaryotic organism (yeast and human, respectively), we 
hypothesize that PPI network structure and biological function are 
related in other eukaryotic organisms as well. Next, since the num- 
ber of functionally unclassified proteins is large even for simple and 
well st udied organisms such as baker's yeast Saccharomyces cere- 
visiae JPena-Castillo and Hughesll2007h , we describe how to apply 
our technique to predict membership in protein complexes, functio- 
nal groups, and subcellular compartments of yet unclassified yeast 
proteins. 

Our method belongs to the group of clustering-based approaches. 
However, compared to other methods that define a cluster as a dense 
interconnected region of a network, our method defines it as a set 
of nodes with similar topological signatures (defined below). Thus, 
nodes belonging to the same cluster do not need to be connected or 
belong to the same part of the network. 



2 METHODS 

Our new measure of node similarity generalizes the degree of a node, which 
counts the number of edges that the node touches, into the vector of graphlet 
degrees, counting the number of graphlets that the node touches; graphlets 
are small connecte d non-isomorphic induced subgraphs of a large network 
iPrzuli et all |2004) (see Figure [TJ. As opposed to partial subgraphs (e.g., 
network motifs iMilo et all 12003) ). graphlets must be induced, i.e., they 
must contain all edges between the nodes of the subgraph that are present in 
the large network. We count the number of graphlets touching a node for all 
2-5-node graphlets, denoted by Go, G\, . . ., G29 in Figure^ counts invol- 
ving larger graphlets become computationally infeasible for large networks. 
Clearly, the degree of a node is the first one in this vector, since an edge 
(graphlet Go) is the only 2-node graphlet. We call this vector the signature 
of a node. It is topologically relevant to distinguish between nodes touching 
a 3-node linear path (graphlet Gi) at an end, or at the middle node; we pro- 
vide a mathematical formulation of this phenomenon for all graphlets with 
2-5 nodes. This is summarized by automorphism orbits (or just orbits, for 
brevity): by taking into account the "symmetries" between nodes of a gra- 
phlet, there are 73 diff erent orbits fo r 2-5-node graphlets, numerated from 
to 72 in Figure[T](see tPrzulj, 2006) for details). Thus, the signature vector 
of a node has 73 coordinates. 

We compute node signature similarities as follows. We define a 73- 
dimensional vector W containing the weights Wi corresponding to orbits 
i G {0, . . . , 72}, where weights are determined as follows. For each orbit, 
we consider the number of orbits affecting it. For example, the differences in 
orbit (i.e., in the degree) of two nodes will automatically imply the diffe- 
rences in all other orbits, since all orbits depend on it. Each orbit i is assigned 
an integer o; that represents the number of orbits that affect it (available upon 
request). We consider that each orbit affects itself. We compute Wi as a func- 
tion of Oi. We need to assign a higher weight uij to the orbits that are not 
affected by many other orbits. Thus, we apply a slow-increasing logarithm 
function to o^s. Also, since the maximum value that an Oi can take is 73 (for 
2-5-node graphlets), we divide logio(oi) by Zo<?io(73) to scale it to [0, 1]. 
Since an orbit dependency count Oi of 1 indicates that no other orbits affect 
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Fig. 1. The thirty 2-, 3-, 4-, and 5-node graphlets Go, Gi, . . . , G29 
and their automorphism orbits 0, 1, 2, . . . , 72. In a graphlet Gi, i G 
nodes belonging to the same orbit are of the same shade 
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orbit i (i.e., this orbit is of the highest importance), we invert this scaled 
value of orbit dependencies as 



1 



iogio(oj) 



logio (73) 

to assign the highest weight of 1 to orbit i with Oj = 1. Clearly, Wi £ 
[0, 1] for all i G {0, . . . , 72} and orbits become less important as their 
weights wi decrease. 

For a node u, we denote by Ui the i th coordinate of its signature vector, 
i.e., Ui is the number of times node u touches orbit i. We define the distance 
Di(u, v) between the i th orbits of nodes u and v as: 



Di(u, v) = Wi X 



\log2(ui + 1) - log2{vi + 1)| 



log2(max{ui,Vi} + 2) 

We use log2 in the numerator because the i th coordinates of signature vec- 
tors of two nodes can differ by several orders of magnitude and we do not 
want the distance measure to be entirely dominated by these large values. 
Also, by using these logarithms, we take into account the relative difference 
between Ui and v, instead of the absolute difference. We add 1 to m and v% 
in the numerator of the formula for Di(u, v) to prevent the logarithm func- 
tion to go to infinity. We scale Di to be in [0, 1] by dividing with the value of 
the denominator in the formula for D, (u, v). We add 2 in the denominator 
of the formula for Di(u, v) to prevent it from being infinite or 0. We find 
the total distance D(u, v) between nodes u and v as: 



D(u,v) 



Clearly, the distance D(u, v) is in [0, 1], where distance means the identity 
of signatures of nodes u and v. Finally, the signature similarity, S(u,v), 
between nodes u and v is: 



S(u,v) = 1 - D(ti,v). 

For a node of interest, we form a cluster containing that node and all 
nodes in a network that are similar to it. According to the signature similarity 
metric, nodes u and v will be in the same cluster if their signature similarity 
S(u, v) is above a chosen threshold. We choose an experimentally determi- 
ned thresholds of 0.9-0.95. For thresholds above these values, only a few 
small clusters are obtained, especially for smaller PPI networks, indicating 
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too high stringency in signature similarities. For thresholds bellow 0.9, the 
clusters are very large, especially for larger PPI networks, indicating a loss 
of signature similarity. To illustrate signature similarities and our choices of 
signature similarity thresholds, in Figure fJJwe present the si gnature vectors 
of yeast proteins in the PPI network of iKrogan et alll20o3) with signature 
similarities above 0.90 (Figure fjA) and below 0.40 (Figure|2]B). Signature 
vectors of proteins with high signature similarities follow the same pattern, 
while those of proteins with low signature similarities have very different 
patterns. 



3 RESULTS AND DISCUSSION 

We apply our method to six 5. cerevisiae PPI networks and 
three human PPI networks. The 5. cere visiae PPI networks ar e 
henceforth den oted by "vonMering-core " dvon Mering et alll2002h 
"vonM ering" dvon Mering et all 20021). "Krogan" jKrogan et al 



20061). "DIP-core" foeane et all l2002h. "DIP" dXenarios etal 



2002h . and "MIPS" JMewes et all 1 20021) . "vonMering-core" con 



tai ns only high-confidence interactions described by von Mering et 
al. dvon Mering et alll200 2*); it contains 2,455 interactions amongst 
988 proteins obtained mainly by tande m affinity purification (TAP) 
jRigaut et allll999l ; lGavin et alj|2002l) and High-Throug hput Mass 
Spect romic Protein Complex Identification (HMS-PCI) jHo et all 
2002). ■ 'vonMering" is the PPI network containing the top 11,000 
high-, medium-, and low-confidence in teractions amongs t 2,40 1 
proteins described by von Mering et al. dvon Mering et all l2002h : 
the dominant techniques used to identify PPIs in this network 
are TAP, HMS-PCI, gene neighborhood, and yeast-two-hybrid 
(Y2H). "Krogan" is the "core" PPI data set containing 7,123 
interactions amongst 2,708 prot eins obtained by TAP experiments 
as described by Krogan et al. l lKrogan et all |2006|) . ' 'DIP-core" 
is the more reliable subset of the yeast PPI netwo rk from DIP 
l Xenarios etafl, |2002|) as described by Deane et al. toeane et all 
120021) ; it contains 5,174 interactions amongst 2,210 proteins. "DIP" 
and "MIPS" are th e yeast PPI networks download ed in Novem- 
ber 2 007 from DIP dXenarios et aill2002h and MIPS jMewes et all 
120021) databases, respectively; they contain 17,201 and 12,525 
interactions amongst 4,932 and 4,786 proteins, respectively. The 
three human PPI n etworks tha t we a nalyze are henceforth deno - 
tedby "BIO GRID" JStark et alll2006l) , "HPRD" JPeri et al.L[2004h . 
and "Rual" teual et all 120051) . "BIOGRID" and "HPRD" are the 
human PPI network s down loaded in Nove mber 2007 from "BIO- 
GRID" l lStark et all 120061) and "HPRD" jPeri et all |2004|) data- 
bases, respectively; they contain 23,555 and 34,119 interactions 
amongst 7,941 and 9,182 proteins, respectively. "Rual" is the human 
PPI network containing 3,4 63 interact i ons am ongst 1,873 proteins, 
as described by Rual et al. teualetalll2005h . We removed all self- 
loops and multiple edges from each of the PPI networks that we 
analyzed. 

The entire PPI network is taken into account when computing 
signature similarities between pairs of nodes (i.e., proteins) and for- 
ming clusters (see section [2}- However, here we only report the 
results of analyzing proteins involved in more than four interacti- 
ons. We discard poorly connected proteins from our clusters because 
they are more likely to b e involved i n noi sy interactions. Similar 
was done by Brun et al. dBrun et all 120041) . Note that the highest 
node degree in the analyzed PPI networks is 286. Also, we dis- 
card very small clusters containing less than three proteins. For the 
remaining clusters, we search for common protein properties: in 
yeast PPI networks, we look for the common protein complexes, 



RP026 



11.02.01 
11.02.02 
11.02.03.01 
16.03 



SMD1 



11.04.03.01 

14.10 

16.03 



SMB1 



11.04.03.01 



Fig. 3. An example of a three-node cluster, consisting of proteins RP026, 
SMD1, and SMB1. The categories of biological functions that the proteins 
belong to are presented bellow the protein names. 



functional groups, a nd subcellular localizations (described in MIPS 
JMewes et all 120021) 1 of proteins belonging to the same cluster; in 
human PPI networks, we look for the common biological processes, 
cellular compone nts, and tissue expressions (described in HPRD 
JPeri et allEiol) ) of proteins in the same cluster. 

Classification schemes and the data for the three protein pro- 
perties that we analy zed in yeast PPI ne tworks were downloaded 
from MIPS database JMewes et all |2002|) in November 2007. For 
each of these three classification schemes (corresponding to pro- 
tein complexes, biological functions, and subcellular localizations), 
we define two levels of strictness: the strict scheme uses the most 
specific MIPS annotations, and the flexible one uses the least speci- 
fic ones. For example, for a protein complex "category" annotated 
by 510.190.900 in MIPS, the strict scheme returns 510.190.900, 
and the flexible one returns 510. Classification schemes and the 
data for the three protein properties that we analyzed in human 
PPI networks (corresponding to biological processes, cellular com- 
ponents, and tissue expre ssions) were downloaded from HPRD 
database JPeri et all |2004|) in November 2007. In order to test 
if our method clusters together proteins having the same protein 
properties, we refine our clusters by removing the nodes that are 
not contained in any of the yeast MIPS protein complex, biolo- 
gical function, or subcellular localization categories, or in any of 
the human HPRD biological process, cellular component, or tissue 
expression categories, respectively. 

In our clusters, we measure the size of the largest common cate- 
gory for a given protein property as the percentage of the cluster 
size; we refer to it as the hit-rate. Clearly, a yeast protein can belong 
to more than one protein complex, be involved in more than one 
biological function, or belong to more than one subcellular com- 
partment (and similar holds for human proteins). Thus, it is possible 
to have an overlap between categories, as well as more than one lar- 
gest category in a cluster for a given protein property. We illustrate 
this for biological functions in the cluster presented in Figure[3] con- 
sisting of yeast proteins RP026, SMD1, and SMB1. According to 
the strict scheme, protein SMD1 is in the common biological func- 
tion category with protein RP026 (16.03), as well as with protein 
SMB1 (11.04.03.01). Thus, there are two largest common biologi- 
cal function categories. The size of the largest common biological 
function category in the cluster is two and the hit-rate is 2/3=67%. 
For the flexible scheme, all three proteins are in one common biolo- 
gical function category (11) and thus, the size of the largest common 
biological function category is three and the hit-rate is 3/3=100%. 

We also define the miss-rate as the percentage of the nodes in 
a cluster that are not in any common category with other nodes in 
the cluster, for a given protein property. For example, in Figure [3] 
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Signatures of proteins with similarities above 0.90 



Signatures of proteins with similarities bellow 0.40 
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Fig. 2. Signature vectors of proteins with signature similarities: (A) above 0.90; and (B) below 0.40. The 73 orbits are presented on the abscissa and the 
numbers of times that nodes touch a particular orbit are presented on the ordinate in log scale. In the interest of the aesthetics of the plot, we added 1 to all 
orbit frequencies to avoid the log-function to go to infinity in the case of orbit frequencies of 0. 



according to the strict scheme, proteins RP026 and SMB 1 are in a 
common biological function category with SMD1, but they themsel- 
ves are not in any common biological function category. Although 
not all three proteins are in the same biological function category 
and the hit-rate is only 67%, the miss-rate is 0/3=0%, since every 
node is in at least one common biological function category with 
another node in the cluster. Clearly, the miss-rate for the flexible 
scheme is also 0/3=0%, since the three proteins are in the same bio- 
logical function category (11) with respect to this scheme. Thus, if 
a protein belongs to several different categories for a given protein 
property (which is expected), the hit-rate in the cluster might be 
lower than 100% (as illustrated in Figure [3}- Therefore, miss-rates 
are additional indicators of the accuracy of our approach. 

For each of the six yeast PPI networks, the three yeast protein 
properties, and the two schemes, we measure the number of clusters 
(out of the total number of clusters in a network) having given hit- 
and miss-rates. We bin the hit- and miss-rates in increments of 10%. 
The results for the flexible scheme are presented in Figure [4] For 
subcellular localizations, in vonMering-core network, 86% of the 
clusters have hit-rate above 90%; for the remaining five networks, 
65% of clusters have hit-rates above 60% (Figure [4] A). For all net- 
works, miss-rates for 72% of clusters are bellow 10% (FigureHB). 
Similarly, for biological functions, the miss-rates in all six networks 
are under 10% for 81% of the clusters (Figure|4]D). The hit-rates for 
biological functions are above 60% for 79% of the clusters in both 
von Mering networks; in the remaining four networks, 57% of the 
clusters have hit-rates above 50% (Figure |4]C). Finally, for protein 
complexes, 47% clusters in vonMering-core, vonMering, and DIP- 
core networks have hit-rates above 60%, 36% of clusters in Krogan 
and MIPS networks have hit-rates above 50%, and 30% of clusters 
in DIP network have hit-rates above 40% (Figure |4]E). Miss-rates 
for protein complexes are bellow 10% for 39% of the clusters in 
both von Mering networks and in DIP-core network; in the remai- 
ning three networks, 33% of the clusters have miss-rates bellow 39% 
(Figure|4]F). 

Similarly, for each of the three human PPI networks and their 
three protein properties that we analyzed, we measure the number 
of clusters (out of the total number of clusters in a network) having 
given hit- and miss-rates. The results are presented in Figure[5] For 



cellular components, in all three human PPI networks, 86% of the 
clusters have hit-rates above 50% (Figure[3]A). Miss-rates for 68% 
of clusters in BIOGRID and HPRD networks are bellow 10%, while 
in Rual network 76% of clusters have miss-rates bellow 29% (Figure 
[5]B). Similarly, for tissue expressions, hit-rates are above 50% for 
74% of clusters in BIOGRID and HPRD networks, and for 98% of 
clusters in Rual network, respectively (Figure [5] C). Miss-rates are 
lower than 10% for 61% of clusters in BIOGRID and HPRD net- 
works, and for 48% of clusters in Rual network, respectively (Figure 
[5]D). Finally, for biological processes, hit-rates are above 50% for 
55% of clusters in BIOGRID network, for 45% of clusters in HPRD 
network, and for 33% of clusters in Rual network, respectively. 
(Figure[5]E). Miss-rates are bellow 29% for 58% of the clusters in 
BIOGRID network and for 71% of the clusters in HPRD network; 
in Rual network, 44% of the clusters have miss-rates bellow 39% 
(Figure[5]F). 

To evaluate the effect of noise in PPI networks to the accuracy 
of our method, we compare the results for the high-confidence 
vonMering-core network and the lower-confidence vonMering net- 
work (Figure©. As expected, clusters in the more noisy network 
have lower hit-rates compared to the high-confidence network. 
However, low miss-rates are still preserved in clusters of both net- 
works for all three protein properties, indicating the robustness of 
our method to noise present in PPI networks. 

Thus far, we demonstrated that our method identifies groups of 
nodes in PPI networks having common protein properties. Our 
technique can also be applied to predict protein properties of yet 
unclassified proteins by forming a cluster of proteins that are simi- 
lar to the unclassified protein of interest and assigning it the most 
common properties of the classified proteins in the cluster. We do 
this for all 115 functionally unclassified yeast proteins from MIPS 
that have degrees higher than four in any of the six yeast PPI net- 
works that we analyzed. In Tables[T]and[2] we present the predicted 
functions for proteins with prediction hit-rates of 50% or higher 
according to the strict and the flexible scheme, respectively. The 
full data set with functional prediction hit-rates lower than 50% 
is available upon request. Note that a yeast protein can belong to 
more than one yeast PPI network that we analyzed. Thus, biological 
functions that such proteins perform can be predicted from clusters 
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Fig. 4. The results of applying our method to the six yeast PPI networks (vonMering-core, vonMering, Krogan, DIP-core, DIP, and MIPS) and the three 
protein properties (subcellular localizations, biological functions, and protein complexes) in accordance with the flexible scheme: (A) hit-rates for subcellular 
localizations; (B) miss-rates for subcellular localizations; (C) hit-rates for biological functions; (D) miss-rates for biological functions; (E) hit-rates for protein 
complexes; (F) miss-rates for protein complexes. 



derived from different yeast PPI networks. We observed an overlap 
of the predicted protein functions obtained from multiple PPI net- 
works for the same organism, additionally verifying the correctness 
of our method. Furthermore, there exists ov erlap between our pro - 
tein function predictions and those of others l lVazquez et a"ill2003h . 
Finally, we successfully predict the functional category of PWP1 



protein that is still functionally unchara cterized in MIPS, but is cha- 
racterized in SGD lChen-vetallll998h as being involved in rRNA 
processing. 

To our knowledge, this is the first study that relates the PPI net- 
work structure to all of the following: protein complexes, biological 
functions, and subcellular localizations for yeast, and cellular com- 
ponents, tissue expressions, and biological processes for human. 
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Fig. 5. The results of applying our method to the three human PPI networks (BIOGRID, HPRD, and Rual) and the three protein properties (cellular com- 
ponents, tissue expressions, and biological processes): (A) hit-rates for cellular components; (B) miss-rates for cellular components; (C) hit-rates for tissue 
expressions; (D) miss-rates for tissue expressions; (E) hit-rates for biological processes; (F) miss-rates for biological processes. 



Starting with the topology of PPI networks of different organisms 
that are of different sizes and are originating from a wide spectrum 
of small-scale and high-throughput PPI detection techniques, our 
method identifies clusters of nodes sharing common protein proper- 
ties. Our method accurately uncovers groups of nodes belonging to 
the same protein complexes in the vonMering-core network: 44% 
of clusters have 100% hit-rate according to the flexible scheme. 
This additionally validates o ur method, since PPIs in this network 
are obtained m ainly by TA P jRigaut et allll999l : lGavin et alll2002h 
and HMS-PCI dHo 'et all l2002r) . which are known to favor protein 
complexes. 

Our node similarity measure is highly constraining, since we take 
into account not only a node's degree, but also additional 72 "gra- 
phlet degrees" (see section [2]l. Since the number of graphlets on n 



nodes increases exponentially with n, we use 2-5-node graphlets 
(see FigureQJ. However, our method is easily extendible to include 
larger graphlets, but this would increase the computational comple- 
xity; the complexity is currently 0(| V\ 5 ) for a graph G(V,E), since 
we search for graphlets with up to 5 nodes. Nonetheless, since our 
algorithm is "embarrassingly parallel" (i.e., can easily be distributed 
over a cluster of machines), extending it to larger graphlets is fea- 
sible. In addition to the design of the signature similarity measure 
as a number in [0, 1], this makes our technique usable for larger 
networks. 
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Table 1. Predicted functions with prediction hit-rates of 50% or higher according to the strict scheme for yeast proteins that are unannotated in MIPS and 
that have degrees higher than four in any of the six yeast PPI networks. The column denoted by "Protein of interest" contains a protein of interest for which 
the function is predicted. The column denoted by "Degree" contains the degree of a given protein in the corresponding PPI network. The column denoted 
by "PPI Network" contains the PPI network from which the protein function was derived. The column denoted by "Number of proteins in cluster" contains 
the total number of proteins in the cluster, including the protein of interest. The column denoted by "Number of unclassified proteins in cluster" contains the 
number of functionally unclassified proteins in a given cluster, including the protein of interest. The column denoted by "Majority (and predicted) function" 
contains the common functions amongst at least 50% proteins in the cluster that are also predicted functions for the protein of interest. The column denoted 
by "Number of proteins in cluster with the majority function" contains the number of nodes in the cluster with the majority function. The column denoted by 
"Hit-rate" contains the percentage of the total number of proteins in the cluster with the majority function; only the maximum hit-rate is reported for a protein 
of interest. Finally, the column denoted by "Miss-rate" contains the percentage of annotated nodes in the cluster that do not have a common function with any 
other annotated node in the cluster. 



4 CONCLUSION 

We present a new graph theoretic method for detecting the relati- 
onship between local topology and function in real- world networks. 
We apply it to proteome-scale PPI networks and demonstrate the 
link between the topology of a protein's neighborhood in the net- 
work and its membership in protein complexes, functional groups, 
and subcellular compartments for yeast, and in cellular components, 
tissue expressions, and biological processes for human. Additio- 
nally, we demonstrate that our method can be used to predict bio- 
logical function of uncharacterized proteins. Moreover, the method 
can be applied to different types of biological and other real-world 
networks and give insight into complex biological mechanisms and 
guidelines for future experimental research. 
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Table 2. Predicted functions with prediction hit-rates higher than 50% according to the flexible scheme for yeast proteins that are unannotated in MIPS and 
that have degrees higher than four in any of the six yeast PPI networks. The columns have the same meaning as in Tabled] 
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